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Abstract In this paper, we address the problems of 
Arabic Text Classification and stemming using Trans¬ 
ducers and Rational Kernels. We introduce a new stem¬ 
ming technique based on the use of Arabic patterns 
(Pattern Based Stemmer). Patterns are modelled us¬ 
ing transducers and stemming is done without depend¬ 
ing on any dictionary. Using transducers for stemming, 
documents are transformed into finite state transduc¬ 
ers. This document representation allows us to use and 
explore rational kernels as a framework for Arabic Text 
Classification. Stemming experiments are conducted on 
three word collections and classification experiments 
are done on the Saudi Press Agency dataset. Results 
show that our approach, when compared with other ap¬ 
proaches, is promising specially in terms of Accuracy, 
Recall and FI. 

Keywords N-gram • Arabic • Classification • Rational 
kernels • automata • Transducers 

1 Introduction 

Text Classification (TC) is the task of automatically 
sorting a set of documents into one or more categories 
from a predefined set [22] . Text classification techniques 
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are used in many domains, including mail spam filter¬ 
ing, article indexing, Web searching, automated popu¬ 
lation of hierarchical catalogues of Web resources, even 
automated essay grading task. 

Due to the complexity of the Arabic language, Ara¬ 
bic Text Classification (ATC) starts receiving great at¬ 
tention. Many algorithms have been developed to im¬ 
prove performance of ATC systems [5 | [6 l lll H12[|13lll4[ 
[T^fTTlfTSlI^ . In general, we can divide an ATC system 
into three steps: 

1. Preprocessing step: where punctuation marks, di¬ 
acritics, stop words and non letters are removed. 

2. Features extraction: a set of features is extracted 
from the text, which will represent the text in the 
next step. For instance, Khreisat used the N- 
gram technique to extract features from documents. 
Another work [23], used stemming to extract fea¬ 
tures. 

3. Learning step: many supervised algorithms were 

used to learn systems how to classify Arabic text 
documents: Support Vector Machines [5lll3[ll8j . K- 
Nearest Neighbours [mi], Naive Bayes and 

many others. Most algorithms rely on distance mea¬ 
sures over extracted features to decide how much 
two documents are similar. 

In the second step, a feature vector is constructed. 
Several stemming approaches are developed dj. Khoja 
and Garside (1999) developed a dictionary based stem¬ 
mer. It gives good performances, but the dictionary 
needs to be maintained. The stemmer developed in [2] 
finds the three-letter roots for Arabic words without 
depending on any roots dictionary or pattern files. 

Many Arabic words have the same stem but not the 
same meaning. Reducing two semantically different words 
to the same root can induce classification errors. To pre- 
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vent this, light stemming is used in TC algorithms [3]. 
Its main idea is that a lot of words generated from the 
same root have different meanings. The basis of this 
light-stemming algorithms consists of several rounds 
over the text, that attempt to locate and remove the 
most frequent prefixes and suffixes from the words. This 
leads to a lot of features due to the light stemming strat¬ 
egy. 

In the third step, many distance measures could 
be used to evaluate distance (or dissimilarity) between 
documents using these feature vectors. The quality of 
the classification system is related to the used distance 
measure. 

In this paper, we study the effect of stemming on 
ATC. Let’s illustrate this by an example. Given two 
simple documents di =” ^ Jiiall 3 

” (Child learns and brought up in “the school), and <22 = 

”<^^1 j j^Aflj” (Schools provide 

education for our children). We compute euclidian dis¬ 
tance between them using 3-grams, with and without 
stemming: 



Distance 


with 3-grams 

Without stemming 

0.25 

With stemming 

0.18 


It is clear that distance between di and d 2 is affected 
by stemming. 

In this work, we enhance the stemming technique, 
introduced by authors in previous paper [20]. Indeed, 
stemmer introduced in [20| gives a set of possible stems. 
Our new stemmer chooses the best stem based on a 
statistical study of characters occurences in the Arabic 
roots corpus. Hence, a comparaison experiment is con¬ 
ducted to assess performances against standard stem- 
mers. This stemming technique transforms documents 
into finite state transducers. Then, rational kernels [9] 
are used as a framework to do ATC m- This frame¬ 
work enables the use of different distance measures or 
kernels. 

This paper is organized as follows. Sectionj^presents, 
in more details, the main stemming techniques. In Sec¬ 
tion we recall some notions on weighted transducers 


and rational kernels. We present, in Section]^ our new 
stemming approach, then we explain how to use ratio¬ 
nal kernels as a framework for ATC. Experiments and 
results are reported and interpreted in Section 


2 Stemming Techniques 

In the context of ATC, stemming is applied to reduce 
dimensionality of the feature vectors. Brute stemming 
(commonly called stemming) transforms each Arabic 
word in the document, into its root. However, light 
stemming, reduces word by removing prefixes and suf¬ 
fixes. 


Brute Stemming 

There are many brute stemming techniques used in the 
context of ATC. They can be classified into two types: 
(i) Stemming using a dictionary^ where dictionary of 
Arabic word stems is needed, (ii) Stemming without 
dictionary^ where stems are extracted without depend¬ 
ing on any root or pattern files. 

Khoja and Garside stemmer m removes the longest 
sufhx and the longest prefix. It then matches the re¬ 
maining word with verb and noun patterns, to extract 
the root by means of a dictionary. The stemmer makes 
use of many linguistic data files such as a list of all di¬ 
acritic characters, punctuation characters, definite ar¬ 
ticles and stop words. This stemmer gives good perfor¬ 
mance but relies on dictionary which needs to be up¬ 
dated. The second technique [2|, finds the three-letter 
roots for Arabic words without depending on any root 
or pattern files. They extract word roots by assigning 
weights and ranks to the letters that constitute a word. 
Consonants were assigned a weight of zero and differ¬ 
ent weights were assigned to the letters grouped in the 

s- 

word (lyOj^L*.) where all affixes are formed by combi¬ 
nations of these letters. The algorithm selects the letters 
with the lowest products (weight x rank) as root let¬ 
ters. Weights and ranks are assigned to letters using a 
little bit information on language [2]. This algorithm, 
like any other brute stemming algorithm, gives the same 
stem for two semantically different words. 
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Light Stemming 

In Arabic language, some word variants do not have 
similar meanings (like the two words: 'LsJLa which 
means library and which means writer). How¬ 

ever, these word variants give the same root if a brute 
stemming is used. Thus, brute stemming can affect the 
meaning of words. Light stemming [3] aims to enhance 
the text classification performance while retaining the 
words meanings. The basis of light-stemming algorithms 
consists of several rounds over the text, that attempt 
to locate and remove the most frequent prefixes and 
suffixes from the word. However, it leads to a lot of 
features. 


3 Weighted Transducers and Rational Kernels 

Before describing our framework, let’s give in what fol¬ 
lows, some preliminaries on Weighted Transducers and 
Rational Kernels. 

Transducers are finite automata in which each tran¬ 
sition is augmented with an output label in addition to 
the familiar input label. Output labels are concatenated 
along a path to form an output sequence as with input 
labels. Weighted transducers are finite-state transduc¬ 
ers in which each transition carries some weight in addi¬ 
tion to the input and output labels. The weight of a pair 
of input and output strings (x, y) is obtained by sum¬ 
ming the weights of the paths labelled with {x^y). The 
following definition gives a formal definition of weighted 
transducers 0110 ] . 

Definition 1 A weighted finite-state transducer T over 
the semiring (K, 0 , 0 , 0,1) is an 8-tuple: 

T = (i7, Z\, Q, /, F, F, A, p) where F is a finite input al¬ 
phabet, Z\ is a finite output alphabet, Q is a finite set 
of states, I ^ Q the set of initial states, F C Q the set 
of final states, F C Q x (F U {e}) x (Z\ U {e}) x K x Q 
a finite set of transitions, A : / ^ K the initial weight 
function, and p : F ^ K the final weight function 

For a path tt in a transducer, p[7r] denotes the origin 
state of that path, n[7r] its destination state and w[7r] 
gives the sum of the weights of its arcs. The set of paths 
from the initial states I to the final states F labelled 
with input string x and output string y is denoted by 
P{I,x^y,F). A transducer T is regulated if the out¬ 
put weight associated by T to any pair of input-output 
strings (x, y) given by: 

A(p[7r]) 0 w[k] 0 p[n[7r]] (1) 


is well-defined in K. |T](x,p) = 0 if P{I,x^y^ F) = 0. 
Figureshows an example of a simple transducer, with 
an input string x : and an output string y : yj . 

The only possible path in this transducer is the singular 
set : P{{0},x,y,{A}). 

Regulated weighted transducers are closed under 
the following operations called rational operations: 

— the sum (or union) of two weighted transducers Ti 
and T 2 is defined by: 

y{x,y)eIJ* xS*, 

[Ti (BT 2 j{x,y) = [Ti] {x,y)® [T 2 I(a:, y) (2) 

— the product (or concatenation) of two weighted trans¬ 
ducers Ti and T 2 is defined by: 

V(x,p)GF*xF*,[Ti0T2l(x,p) = 

0 lTiUxuyi)(E)lT2iix2,y2) (3) 

x=xix2,y=yiy2 

— The composition of two weighted transducers Ti and 
T 2 with matching input and output alphabets F, is 
a weighted transducer denoted by Ti o T 2 when the 
sum: 

lT,oT2}{x,y) = ln^{x,z)®lT2Kz,y) (4) 

2 ; inU* 

is well-defined in K for all x, p G F* 

Rational Kernels are a general family of kernels, 
based on weighted transducers, that extend kernel meth¬ 
ods to the analysis of variable-length sequences or more 
generally weighted automata. Let X and Y be non¬ 
empty sets. A function K : X x Y ^ M. is said to be 
a kernel over X x Y. Corinna et al. [9] give a formal 
definition for rational kernels: 

Definition 2 A kernel K over F* x Z\* is said to be ra¬ 
tional if there exist a weighted transducer T = (F, Z\, Q, 
I^F^E^X^p) over the semiring K and a function p : 
K ^ R such that for all x G F* and p G Z\*: 

K{x,y) = ip{lTj{x,y)) (5) 

K is then said to be defined by the pair (p, T). 

4 Framework for Arabic Stemming and Text 
Classification 

In the following we explain how to use transducers to do 
stemming. First, Arabic patterns, prefixes and suffixes 
are modelled by simple transducers, then, a stemming 
transducer is constructed using these simple ones by 
applying rational operations like concatenation, union 
and composition. Then, we show how to use rational 
kernels as a framework to do ATC. 


1^1 (^ 5 ^) — ®7rePiI,x,y,F) 
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Stemming by Transducers 

Arabic language differs from other languages syntac¬ 
tically, morphologically and semantically. One of the 
main characteristic features is that most words are built 
up from roots by following certain fixed patterns and 
adding prefixes and suffixes. For instance, the Arabic 
word (school) is built from the three-letters root 

or stem (learn) and using the pattern then 

prefix Jl and sufhx 5 (which is used to denote female 
gender) are added. This results in the measure 
(see Table[^. Notice here that the letter lJ denotes the 
first letter of the three-letters root, ^denotes the sec¬ 
ond letter and J denotes the third one. 

We will use measures to construct a transducer which 
do stemming. Figure shows the example of the mea¬ 
sure This transducer (T^easurei) can be used to 

extract the three-letters root of any Arabic word match¬ 
ing this measure. This is achieved by composition op¬ 
eration We consider T^ord^ the transducer which 
maps any string to itself, i.e., the only possible path 
is the singleton set P{{0}^word^word^{i}) (Figure 
shows transducer associated to the Arabic word 
)• 

The composition of two transducers is also a transducer. 

(Pword ^ Pmeasurel^i'^Ovd^ — 

^ ^ Twordid^O'^d^ z) • TjYieasureli^^•) V) 

Since the only possible string matching 2 : is 2 : = word^ 
we conclude that: 

{ddyjord ^ (^wovd^ 

T^ord{word,WOrd) • Tmeasurel{word,y) 

As we have T^ordi^^ord^word) = 1, so: 

{Tword ^ ddmeasurel^id^Ovdj ^) = TjYiQQ^gy^j^QxiwOvd^ ^) 

If word matches with the measure the output projection 
will extract the root (or stem) y associated to word. 

In Arabic language, there are 4 verb prefixes (I j 

O (^), 12 noun prefixes (^ J ^ JJ c lJ (O (V ‘J' 

i ^ i ^ i i j J ^ ^ i b i ^ i cP T Y" 

i OI ^ I ^ O i 5 i 0 i i j i i jJA a;) . When considering 
the diacritics, there are niore* than 3000 patterns (in 
our knowledge). Since we don’t consider diacritics in 
our approach, patterns are much less (less than 200), 


much of them are not used in the context of Modern 
Standard Arabic. Indeed, the patterns (^ 

^ 1^ Jl 

^ ) will result in only one pattern (J^) after 

removing diacritics. For illustration. Tables |2|3| shows 
some examples of noun and verb patterns. 

We adopt the following process, to construct the 
stemming transducer, which enable us to include all 
measures: 

1. Building the transducer of all noun prefixes (resp. verb 
prefixes); 

2. Building the transducer of all noun patterns (resp. verb 
patterns); 

3. Building the transducer of all noun suffixes (resp. verb 
suffixes); 

4. Concatenate noun transducers (resp. verb transducers) 
obtained in 1, 2 and 3. 

5. Sum the two transducers obtained in step 4. 

The first and third steps are very simple. We construct a 
transducer for each prefix (resp. suffix) then we do the 
union of these transducers. The resulting transducer 
represents the prefixes (resp. suffixes) transducer (see 
Figureand Figure]^. In the second step, we build all 
possible noun pattern transducers. Then, the sum of 
these transducers represents the transducer of all noun 
patterns. We do the same to build the transducer of all 
verb patterns (Figure]^. In the forth step, transducers 
obtained in steps 1, 2 and 3 are concatenated. The final 
transducer is obtained by the union of transducers built 
in step 4. 

The resulting transducer Tstemmer could not be rep¬ 
resented graphically because of large number of states 
(about 400 states). This transducer can stem any well- 
formed Arabic word, i.e, a word which matches with 
some Arabic measure. In addition, it can give us a se¬ 
mantic information about the stemmed word. This in¬ 
formation can be used to improve the quality of classi¬ 
fication system. 

Transducers are created and manipulated using the 
OpenFst library [4], which is an open source library 
for constructing, combining, optimizing, and searching 
weighted finite-state transducers. 


Ponderation of Our Stemmer 


The composition of Tstemmer with any given word trans¬ 
ducer Tword gives a transducer which may include many 
paths, so many possible roots. Indeed, an Arabic word 
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could match with more than one measure at the same 
time. Lets take the word j^\ (win). This Arabic word 

matches with, at least, two measures: and 

giving the stems and respectively. Thus, the use 
of Tstemmer leads to a Set of one or more possible stems. 
The correct stem belongs to the set of possible stems. 
To cope with this situation, stemming transducer must 
be weighted. Many schemes are possible. We use a bi¬ 
gram window probabilities technique to affect a score 
to a given stem. The technique is based on a statistical 
study of letter frequencies in the Arabic roots corpus. 
This corpus contains more than 10 thousands three let¬ 
ters roots. The score is affected to a given stem by calcu¬ 
lating the probability of letter occurrences in different 
positions. Let s = C 1 C 2 C 3 a three letters stem. Score{s) 
is calculated by: 

Score{s) = Pl(ci,C2) X P2(C2,C3) 

where Pi(ci, C 2 ) is the probability to have the letter C 2 
in the second position preceded by ci, and P 2 (c 2 ,C 3 ) is 
the probability to have the letter C 3 in the third po¬ 
sition preceded by C 2 . Thus we consider the correct 
stem is the one that has the best score best = 

Arg{Max{Score{s) s G {Possible stems}}). 


Rational Kernels for Arabic Text Classification 
Our ATC system is divided into three stages: 

1 . preprocessing step. 

2 . feature extraction: the previous transducer is ap¬ 
plied on each word of the document resulting from 
step 1. Then, the transducer resulting from the con¬ 
catenation of these words stems transducers will 
represent the document in the next step. 

3. learning task: Rational kernels will be used to mea¬ 
sure distance between documents jOilTO] . and SVM 
will be used to do classification. 

Considering a set of documents, each document con¬ 
sists of a sequence of words: W 1 W 2 ... Applying our 
stemming transducer on each word of a document and 
right concatenate results will transform this document 
into finite state transducer. These transducers will be 
packaged into an archive file (far) to be treated by the 
learning algorithm (Figure]^. OpenKernel, which is a 
library for creating, combining and using kernels for 
machine learning applications, will be used to acceler¬ 
ate experiments. 


5 Experimental Results and Discussion 

The next batch reports the main commands of Open- 
Fst and OpenKernel libraries used to implement our 
classification system. 


ifstcompose word.fst model.fst result.fst 
2 fstconcate doc.fst result.fst doc.fst 
sfarcreate data.list data.far 

4klngram —order=3 —sigma=29 data.far 3gram.kar 

5svm— train —k openkernel —K 2gram.kar cul.train cul.train.2gram.mdl 

6svm —predict cul.test cul.train.2gram.mdl cul.test.2gram.pred 


To stem words in the document, we iterate on these 
words using the OpenFst command [4] fstcompose (line 
1 ), where word.fst is a linear finite state transducer with 
identical input and output labels, which represents a 
word, and model.fst is our ponderated stemming trans¬ 
ducer . The resulting transducer result.fst represents 
the best stem. Resulting transducers are right concate¬ 
nated to a finite state transducer {doc.fst)^ represent¬ 
ing the entire document, using the OpenFst command 
fstconcate (line 2). The set of finite state transducers 
(FSTs) is then packaged in a FST archive (Far) us¬ 
ing the OpenKernel command farcreate (line 3), where 
data, list contains the list of all FST documents, one file 
per line, and data.far is the FST archive (Far). 

Various types of kernels could be created using OpenKer- 
nel library. 3-gram kernels could be created using the 
command klngram (line 4), where the first argument - 
order specifies the size of the n-grams, and the second 
argument -sigma specifies the size of the alphabet, ep¬ 
silon not included (Arabic alphabet size is 28). The first 
parameter is the FST archive {data.far) and the second 
parameter {Sgram.kar) is the resulting kernel archive. 

OpenKernel library includes a plugin for the Lib- 
SVM implementation [ 8 ]. This enables us to do train¬ 
ing, predicting and scoring on our dataset. Training 
command creates a model on the training set (line 5), 
where the first argument -k specifies the kernel format, 
the second one (-K) specifies the n-gram kernel archive. 

The first parameter specifies a correctly classified sub¬ 
set of the training set, the second parameter is the re¬ 
sulting model. In this command, cul.train contains a la¬ 
belled sub set of training documents belonging to Cul¬ 
tural class. Having a model, we can use it to classify 
documents of the testing dataset with the command 
svm-predict (line 6 ), where the first parameter speci¬ 
fies a correctly classified subset of the testing set, the 
second parameter is the resulting model from the pre¬ 
vious command. The last parameter contains the result 
of prediction using the model. 
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Stemming Results 

To check the performances of our stemmer, experiments 
were performed on three word collections. The first one 
(Goldl) is a sample taken from the Corpus of Contem¬ 
porary Arabic [21]. The two others (Cold2 and Gold3) 
are house built sets. All words of these sets were anno¬ 
tated by hand with the correct root. Roots have been 
checked by Arabic Language scholars who are experts 
in the Arabic Language. The three sets are picked ran¬ 
domly from different topics, including politics, culture, 
sport and news. Table [^ gives an overview of these three 
collections. We give for each gold, the number of words 
(7^ words). Table [^reports the accuracy of our stemmer 
on the three sets of words. 

Experiment results show the effectiveness of our ap¬ 
proach of stemming. Results on different corpora are 
stable and the best score is achieved with the greatest 
corpus (Golds). Our stemmer results are sandwiched 
between Khoja and Al-Serhan stemmer results. This 
can be explained by the fact that Khoja’s stemmer is 
a dictionary based tool, which makes it language de¬ 
pendent. Al-Serhan stemmer is an unsupervised one. It 
uses a little bit information about the language. Our 
stemmer is a semi-supervised tool. It uses a language 
knowledge -patterns- but only in the construction stage. 
Patterns are fixed and do not change. 


ATC Results 

We perform experiments on the Saudi Press Agency 
(SPA) dataset [6| for training and testing the ATG sys¬ 
tem. As detailed on Table this dataset contains 1,526 
text documents belonging to one of the six categories: 
culture, economic, social, general, politics and sport. 
As mentioned before, stop words, non Arabic letters, 
symbols and digits were removed. We have used 80% of 
documents for training the classifier and 20% for test¬ 
ing. Learning is done using LibSVM implementation [8] , 
included in Openkernel, with three different n-gram ker¬ 
nels (n = 2,3,4). Since we want to show the effect of 
stemming, we report results of the three classifier ver¬ 
sions; without stemming (Glassifier I), with Al-Serhan 
stemmer (Glassifier 2) and with our stemmer (Glassifier 
3), in terms of accuracy, precision, recall and FI. In Fig¬ 
ures mill and El we report results in terms of accuracy 
and precision for the three classifiers with the three ker¬ 
nels (bigrams, 3-grams and 4-grams). Figures [s][1Q| and 
l^give results in terms of recall and FI for the same 
classifiers. 

Goncerning the quality of classification. Figure [T3| 
shows that best results were reached with 3-grams ker¬ 
nel for accuracy, recall and FI measures. This can be 


explained by the fact that over than 80% of Arabic 
words are built from 3-letter roots. 

For the 3-gram kernel, let us measure the effect of 
stemming on classification. For most classes, stemming 
enhance results in terms of accuracy. Recall and FI (see 
Figures 1^ and 10). However, for precision, stemming 
affects negatively performances (see Figure]^. 

One can argue the best scores observed by sport 
class by the fact that it uses a specific vocabulary. Poor 
results are reported for the General class. This is ex¬ 
pected given the used words in this kind of documents 
which are generic. At last, our classifier surpasses other 
classifiers in most cases. 


6 Conclusion 

In this paper we introduced a new framework for Arabic 
word stemming and Text classification. It is based on 
the use of transducers for stemming, and rational ker¬ 
nels for measuring distance between documents. First, 
our stemmer uses transducers for modelling Arabic pat¬ 
terns. Second, rational kernels are used to measure dis¬ 
tances between documents. Experiments and analysis 
of this framework in the context of Arabic Text Glas¬ 
sification show that stemming improves the quality of 
classifiers in terms of accuracy, recall and FI. But it 
lightly decreases the precision. 3-grams based classi¬ 
fiers reached the best results. Like that of Al-Serhan, 
our approach of stemming do not rely on dictionary, 
and it gives better results. 

In future work, other kernels, like word-grams and 
gappy grams, will be investigated. 
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Fig. 1 Transducer corresponding to the word (school). 



Fig. 2 Example of a transducer. 



Fig. 3 Transducer of noun prefixes (top) and verb prefixes 
(bottom). 


ficl 



Fig. 4 Transducer of noun and verb suffixes. 
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Table 1 Measures for the 3-letters root j and built words. 


Measures 

iJLxi.4 ilUiJl 



Words 




Table 2 Examples of noun patterns. 

Noun Patterns 

3-letters 

4-letters 5-letters 

6-letters 

7-letters 






Jjjt? 







Table 3 Examples of verb patterns. 



Verb Patterns 



3-letters 

4-letters 3-letters -|-1 






3-letters -\-2 3-letters -|-3 4-letters -|-1 






J*iul 


j^> 






Table 4 Gold Standards details. 



Corpus ^ words 



Goldl 

679 



Gold2 

844 



Golds 

1,000 



Table 5 Accuracy of Stemmers. 



Corpus 

Khoja Our 

Al-Serhan 


Stemmer Stemmer 

Stemmer 



% % 

% 


Goldl 

82,77 71.68 

51,40 


Gold2 

85,55 74,82 

49,64 


Gold3 

87,60 80.30 

56,40 


Average 

85,30 75.60 

52,48 


Table 6 SPA corpus details. 

Categories 

Training texts Testing texts 

Total 

Culture 

201 

57 

258 

Economics 

200 

50 

250 

Social 

203 

55 

258 

Politics 

200 

50 

250 

General 

205 

50 

255 

Sports 

205 

50 

255 


1,214 

312 

1,526 
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Fig. 5 Transducer of verb patterns. 



Fig. 6 Transformation of a text document into a finite state 
transducer. 


Accuracy 



Cultural EcDn-DmicsGaneral Politics Social Sport Average 



Cultural EconomicsGeneral Politics Social Sport Average 


■ Classifier 1 

■ Classifier 2 

■ Classifier 3 


■ Classifier 1 

■ Classifier 2 

■ Classifier 3 


Fig. 7 Accuracy and Precision of SVM Classification using 
2-gram Kernel. 


Recall 


0,7 

0,6 

0,5 

0,4 

0,3 

0,2 

0,1 

0 



I 



■ Classifier 1 

■ Classifier 2 

■ Classifier 3 


Cultural Economics General Politics Social Sport Average 


F1 



Fig. 8 Recall and FI of SVM Classification using 2-gram 
Kernel. 
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Accuracy 


Accuracy 


95 % 

961 ^ 

94 ^ 

92 % 

&D% 

55 % 

55 % 

54 % 

52 % 

&D^ 

75 % 

75 % 

74 % 



■ Classifier 1 

■ Classifier 2 

■ Classifier 3 


Cultural E CHIDmics General Politics Social Sport Average 



Precision 



Cultural EconomicsGeneral Politics Social Sport Average 


■ Classifier 1 

■ Classifier 2 

■ Classifier 3 


Precision 



59 % 

49 % 




■ Classifier 1 

■ Classifier 2 

■ Classifier 3 


Cultural EconomicsGeneral Politics Social Sport Average 


Fig. 9 Accuracy and Precision of SVM Classification using 
3-gram Kernel. 


Fig. 11 Accuracy and Precision of SVM Classification using 
4-gram Kernel. 


Recall 


Recall 


0,9- 



■ Classifier 1 

■ Classifier 2 

■ Classifier 3 



■ Classifier 1 

■ Classifier 2 

■ Classifier 3 


Cultural Economics General Politics Social Sport Average 
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Fig. 10 Recall and FI of SVM Classification using 3-gram Fig. 12 Recall and FI of SVM Classification using 4-gram 
Kernel. Kernel. 
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Fig. 13 Accuracy and Precision averages using Bigram, 3- 
gram and 4-gram Kernels. 
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